spata-v2-genes-gene-sets-and-features.RmdGenes and gene-sets are of major importance when it comes to the analysis of every RNA sequencing results. The average count-matrix contains information of these expression levels for more than 20.000 genes. While referring to a few of them by entering their names manually might be tolerable things can get cumbersome rapidly once the number of genes of interest becomes double- or tripple digit. Still, the majority of functions in SPATA2 needs information about the genes, gene-sets and features of interest in form of character vectors. The following tutorial introduces important helper functions that make living easier when it comes to working in SPATA2 and working with these informative variables in general.
# load packages
library(SPATA2)
library(tidyverse)
# load object
spata_obj <- loadSpataObject("data/spata-obj-gbm275.RDS")The term gene-set refers to a collection of several genes that are known to interact with each other or being co-expressed together under certain conditions. Information about the gene-sets defined in your spata-object is stored in the slot @used_genesets which contains a data.frame of two character-variables.
In SPATA2 gene-set names are functionally divided into their class and their actual name. The class is determined by the character string that comes before the first *_*, the name refers to the rest. Subsequent chapters will highlight why this functional separation is useful as well as why it is useful to really think about the way you name your personally defined gene-sets. In order to obtain an overview of your saved gene-sets run printGeneSetOverview().
printGeneSetOverview(spata_obj)By default any initiateSpataObject_*()-function assigns a gene-set data.frame to your spata-object containing a variety of predefined gene-sets from the Molecular Signature Database. Run data(gsdf) to load and ?gsdf to get information about it and what the class-names stand for. If you want to add new ones you can do that via addGeneSet().
# add a new gene-set of three genes
spata_obj <-
addGeneSet(object = spata_obj,
class_name = "New",
gs_name = "example_1",
genes = c('SLC35E2A', 'NADK', 'GFAP')) # genes that exist in your expression matrix
printGeneSetOverview(spata_obj)A more convenient way would be to open an interactive application via addGeneSetsInteracive() as displayed below. Make sure to close the application via the respective button and to store the resulting object in the same object in order to overwrite it and not to overcrowd your global environment with spata-objects that only differ in a few gene-sets.
spata_obj <- addGeneSetsInteractive(spata_obj)
Figure 1.1 Interface of addGeneSetsInteractive()
The gene-set data.frame is infinitely expandable. If you want to discard gene-sets or alter them you can either use addGeneSet() with argument overwrite set to TRUE in order to overwrite existing ones or discardGeneSets() as shown below.
# discard gene-sets by name
spata_obj <- discardGeneSets(object = spata_obj, gs_names = c("New_example_2"))
# display updated overview
printGeneSetOverview(spata_obj)In order to save the gene-set data.frame of one particular spata-object for further use with other spata-objects use saveGeneSetDf() which saves the @used_genesets slot of the specified object as a .RDS-file.
saveGeneSetDf(object = spata_obj, directory = "data/gene-set-df.RDS") If you want to set your gene-set data.frame manually use setGeneSetDf().
Given the nature of your sample it will happen that your expression matrix does not contain genes (0 reads across all barcode-spots) although they appear in certain gene-sets. joinWithGeneSets() which works under the hood of a variety of functions does not allow for gene-sets to join your data if less than 25% of the genes needed to form that gene-set are found. If this threshold is to loose to you you can use adjustGeneSetDf() to set the limit yourself. It will then discard all gene-sets whoose belonging genes are found to a percentage lower than the threshold you set. (Discarding gene-sets from your gene-set data.frame effectively makes it impossible to work with them within SPATA2.)
getGenes() and getGeneSets() return character vectors which are valid inputs for all SPATA2-functions in which one has to refer to genes or gene-sets in any way.
As we have seen above gene-sets are a set of genes that are functionally gathered under a certain name. getGeneSets() does not return the genes of certain gene-sets but the gene-set-names as strings It does that either as a single character vector or as a named list.
# to get all gene-set names set 'of_class' to 'all'
all_gs <- getGeneSets(spata_obj, of_class = "all")
length(all_gs)## [1] 11665
head(all_gs, 10)## [1] "BC_RELA_PATHWAY" "BC_NO1_PATHWAY"
## [3] "BC_CSK_PATHWAY" "BC_SRCTMRPTP_PATHWAY"
## [5] "BC_AMI_PATHWAY" "BC_GRANULOCYTES_PATHWAY"
## [7] "BC_LYM_PATHWAY" "BC_ARAP_PATHWAY"
## [9] "BC_AGR_PATHWAY" "BC_AKAP95_PATHWAY"
# get gene-set names of certain classes
getGeneSets(spata_obj,
of_class = c("HM", "New"),
simplify = TRUE # as a single vector
) %>% tail() # show last six ## [1] "HM_ALLOGRAFT_REJECTION" "HM_SPERMATOGENESIS" "HM_KRAS_SIGNALING_UP"
## [4] "HM_KRAS_SIGNALING_DN" "HM_PANCREAS_BETA_CELLS" "New_example_1"
# get gene-set names of certain classes
getGeneSets(spata_obj,
of_class = c("HM", "New"),
simplify = FALSE # as a named list
) %>% lapply(head) #show first six## $HM
## [1] "HM_TNFA_SIGNALING_VIA_NFKB" "HM_HYPOXIA"
## [3] "HM_CHOLESTEROL_HOMEOSTASIS" "HM_MITOTIC_SPINDLE"
## [5] "HM_WNT_BETA_CATENIN_SIGNALING" "HM_TGF_BETA_SIGNALING"
##
## $New
## [1] "New_example_1"
To subset your gene-sets of interest additionally make use of the argument index which takes a character string as a regular expression and filters the gene-sets to be returned again according to it.
# get all gene-sets of class HALLMARK referring to something specific
getGeneSets(spata_obj,
of_class = "HM",
index = "ESTROGEN|ANDROGEN")## [1] "HM_ESTROGEN_RESPONSE_EARLY" "HM_ESTROGEN_RESPONSE_LATE"
## [3] "HM_ANDROGEN_RESPONSE"
If you want to skim your gene-sets interactively use getGeneSetsInteractive() which opens an application in your R-Studio’s viewer pane and returns a vector of chosen gene-sets.
Eventually expression matrices contain information about gene expression levels and not gene-set expression levels. If you are looking for genes of certain gene-sets getGenes() provides an easy we to do just that.
# getGenes() without any further specification will return all genes found
# in your expression matrix ...
all_genes <- getGenes(spata_obj)
# ... which can be quite a lot
length(all_genes)## [1] 21548
# with regards to gene-sets it makes more sense to break genes down by
# their gene-set belonging
estrogen_gs <-
getGeneSets(spata_obj,
of_class = "HM",
index = "ESTROGEN")
estrogen_genes <-
getGenes(spata_obj,
of_gene_sets = estrogen_gs,
simplify = FALSE) # to return them sorted in a list
#hint: "..." was added manually for visualization purpose as these gene-sets contain
# more than 150 genes ## $HM_ESTROGEN_RESPONSE_EARLY
## [1] "GREB1" "CA12" "SLC9A3R1" "MYB" "ANXA9" "IGFBP4"
## [7] "SYBU" "NPY1R" "PDZK1" "NRIP1" "MLPH" "HSPB8"
## [13] "EGR3" "KRT19" "LRIG1" "KDM4B" "PGR" "RHOBTB3"
## [19] "TPD52L1" "ELOVL2" "..."
##
## $HM_ESTROGEN_RESPONSE_LATE
## [1] "SLC9A3R1" "TPD52L1" "PRSS23" "CA12" "PDZK1" "ANXA9"
## [7] "CELSR2" "PGR" "RET" "MYB" "TPBG" "EGR3"
## [13] "ARL3" "OLFM1" "NPY1R" "SCNN1A" "XBP1" "AREG"
## [19] "IL17RB" "NRIP1" "..."
If you want to work with genes manually and skim them interactively use getGenesInteractive() which opens an application in your R-Studio’s viewer pane.
In SPATA2 features refer to all kinds of information of barcode-spots that do not fall explicitly under gene- or gene-set expression levels. This includes in particular group belonging such as clustering or spatial segmentation. But additional numeric features such as monocle3-pseudotime or everything else that has been computed is considered to be a feature as well. Feature information is stored - in a tidy data fashion - in the slot @fdata.
To obtain all feature information use getFeatureDf().
getFeatureDf(object = spata_obj)In order to specify features of interest as input in functions you need to specify the name that particular feature variable carries. Several functions only work with feature data of one kind - either numeric features or discrete/categorical features. The function getFeatureNames() provides you with information about all currently available features in your spata-object as well as about the class they belong to.
# get all feature names
getFeatureNames(object = spata_obj)## numeric integer numeric numeric
## "nCount_Spatial" "nFeature_Spatial" "percent.mt" "percent.RB"
## factor character character
## "seurat_clusters" "segmentation" "igraph_clusters"
# get only names of numeric features
numeric_features <-
getFeatureNames(object = spata_obj, of_class = c("numeric"))
# output
numeric_features## numeric numeric numeric
## "nCount_Spatial" "percent.mt" "percent.RB"
# get only names of discrete features
discrete_features <-
getFeatureNames(object = spata_obj, of_class = c("character", "factor"))
# output
discrete_features## factor character character
## "seurat_clusters" "segmentation" "igraph_clusters"
Additionally getFeatureValues() allows to quickly access the unique values of discrete feature variables such as clusters.
getFeatureValues(object = spata_obj, features = discrete_features)## $seurat_clusters
## [1] 0 2 1 3 5 6 4 7 8
## Levels: 0 1 2 3 4 5 6 7 8
##
## $segmentation
## [1] "none" "oxid-phosph-high"
##
## $igraph_clusters
## [1] "Cluster 2" "Cluster 6" "Cluster 1" "Cluster 5" "Cluster 4" "Cluster 3"
## [7] "Cluster 7"
When it comes to grouping variables such as clusters a more intuitive approach is offered by the two functions getGroupingOptions() and getGroupNames().
# obtain all variables that group the sample's barcode-spots in a certain way
getGroupingOptions(object = spata_obj)## factor character character
## "seurat_clusters" "segmentation" "igraph_clusters"
# obtain the names of the groups a certain grouping variable contains
getGroupNames(object = spata_obj, discrete_feature = "seurat_clusters")## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8"
If you want to change the name of a feature variable you can do so with renameFeatures().
# old feature names
getFeatureNames(object = spata_obj)## numeric integer numeric numeric
## "nCount_Spatial" "nFeature_Spatial" "percent.mt" "percent.RB"
## factor character character
## "seurat_clusters" "segmentation" "igraph_clusters"
# change some names
spata_obj <- renameFeatures(object = spata_obj,
"seurat_cl" = "seurat_clusters",
"nCount" = "nCount_Spatial"
)
# new feature names
getFeatureNames(object = spata_obj)## numeric integer numeric numeric
## "nCount" "nFeature_Spatial" "percent.mt" "percent.RB"
## factor character character
## "seurat_cl" "segmentation" "igraph_clusters"
Referring to clusters/groups with numbers is often suboptimal. Once your differential gene expression analysis has given you insight in what constitutes certain group of barcode-spots you might want to give it a more informative name. You can do so with renameGroups().
# old group names
getGroupNames(object = spata_obj, discrete_feature = "seurat_cl")## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8"
# rename cluster
spata_obj <- renameGroups(object = spata_obj,
discrete_feature = "seurat_cl",
"reactive-inflammatory" = "3",
"reactive-hypoxia" = "4"
)
# visualize renamed clustering
plotSurface(object = spata_obj, color_by = "seurat_cl")-1.png)
Whenever a warning message appears telling you that a certain gene, gene-set or feature was not found this means that the character string provided was not found throughout your spata-object. Check for typos in this case. The same is true for gene-set names with regards to the ont-variable in the gene-set data.frame.
The output vectors of getGenes(), getGeneSets() and getFeatureNames() are a perfectly valid inputs for arguments like color_by or variables.
hallmark_gs <- getGeneSets(object = spata_obj, of_class = "HM")
# output
hallmark_gs[1:10]## [1] "HM_TNFA_SIGNALING_VIA_NFKB" "HM_HYPOXIA"
## [3] "HM_CHOLESTEROL_HOMEOSTASIS" "HM_MITOTIC_SPINDLE"
## [5] "HM_WNT_BETA_CATENIN_SIGNALING" "HM_TGF_BETA_SIGNALING"
## [7] "HM_IL6_JAK_STAT3_SIGNALING" "HM_DNA_REPAIR"
## [9] "HM_G2M_CHECKPOINT" "HM_APOPTOSIS"
# create a spata-data.frame
joined_df <- joinWith(object = spata_obj,
spata_df = getCoordsDf(spata_obj),
gene_sets = hallmark_gs,
features = discrete_features)
#output
joined_dfWith our vector hallmark_gs containing our gene-sets of interest we can work throughout our R-session.
plotRidgeplot(object = spata_obj,
variables = hallmark_gs[1:8],
across = "seurat_cl",
across_subset = c("reactive-hypoxia", "reactive-inflammatory"),
ncol = 2
)
Figure 3.1. Example plot on how to use getGeneSets()